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Abstract 

Contextual information can have a substantial impact on 
the performance of visual tasks such as semantic segmen¬ 
tation, object detection, and geometric estimation. Data 
stored in Geographic Information Systems (GIS) offers a 
rich source of contextual information that has been largely 
untapped by computer vision. We propose to leverage such 
information for scene understanding by combining GIS re¬ 
sources with large sets of unorganized photographs us¬ 
ing Structure from Motion (SfM) techniques. We present 
a pipeline to quickly generate strong 3D geometric priors 
from 2D GIS data using SfM models aligned with mini¬ 
mal user input. Given an image resectioned against this 
model, we generate robust predictions of depth, surface nor¬ 
mals, and semantic labels. Despite the lack of detail in the 
model, we show that the precision of the predicted geome¬ 
try is substantially more accurate than other single-image 
depth estimation methods. We then demonstrate the util¬ 
ity of these contextual constraints for re-scoring pedestrian 
detections, and use these GIS contextual features along¬ 
side object detection score maps to improve a CRF-based 
semantic segmentation framework, boosting accuracy over 
baseline models. 

1. Introduction 

The problems of object detection and estimation of 3D 
geometry have largely been pursued independently in com¬ 
puter vision. However, there seem to be many good argu¬ 
ments for why these two sub-disciplines should join forces. 
Accurate recognition and segmentation of objects in a scene 
should constrain matching of features to hypothesized sur¬ 
faces, aiding reconstruction. Similarly, geometric infor¬ 
mation should provide useful features and contextual con¬ 
straints for object detection and recognition. The use of 
detailed stereo depth has already proven to be incredibly 
effective in the world of object detection, particularly for 
RGB-D sensors in indoor scenes [17, 27]. More general 
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formulations have attempted to jointly integrate geometric 
reconstruction from multiple images with scene and object 
recognition [5, 7, 23]. On the other hand, works such as 
[22, 16, 33, 4 ] have focused on the role of context in a sin¬ 
gle image, operating under the assumption that the camera 
and scene geometry are unknown and must largely be in¬ 
ferred based on analysis of monocular cues. This problem 
is quite difficult in general although some progress has been 
made [21, 38, 10], particularly on indoor scenes of buildings 
where the geometry is highly regular [37, 18, 19]. 

In this paper, we argue that a huge number of pho¬ 
tographs taken in outdoor urban areas are pictures of known 
scenes for which rich geometric scene data exists in the 
form of GIS maps and other geospatial data resources. Ro¬ 
bust image matching techniques make it feasible to resec¬ 
tion a novel image against large image datasets to pro¬ 
duce estimates of camera pose on a world-wide scale [26]. 
Once a test photo has been precisely localized, much of 
this contextual information can be easily backprojected into 
the image coordinates to provide much stronger priors for 
interpreting image contents. For the monocular scene¬ 
understanding purist, this may sound like “cheating”, but 
from an applications perspective, such strong context is 
already widely available or actively being assembled and 
should prove hugely valuable for improving the accuracy of 
image understanding. 

To study the role of strong geometric context for image 
understanding, we have collected a new dataset consisting 
of over six thousand images covering a portion of a univer¬ 
sity campus for which relative camera pose and 3D coordi¬ 
nates of matching points has been recovered using structure 
from motion. We describe a method for aligning this ge¬ 
ometric and photometric data with 2D GIS maps order to 
quickly build 3D polygonal models of buildings, sidewalks, 
streets and other static structures with minimal user input 
(Section 3). This combined geosemantic context dataset 
serves as a reference against which novel test images can 
be resectioned, providing a rich variety of geometric and 
semantic cues for further image analysis. We develop a set 
of features for use in detection and segmentation (Sections 4 
and 5) and experimentally demonstrate that these contextual 
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1. Model construction 

A rich 3D model is built 
using minimal input with an 
image assisted interface. 



2. Test image resection 

Geosemantic priors are collected 
by model backprojection. 



3. Context quantization 

Features are extracted 
and fed into a detection 
or segmentation pipeline. 
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Figure 1: System overview: 2D GIS and SfM data are aligned to build a 3D model with minimal effort using an image-assisted Sketchup 
plug-in. This fused geocontext provides a basis for efficiently transferring rich geometric and semantic information to a novel test image 
where it is used to improve performance of general scene understanding (depth, detection, and segmentation). 


cues offer significantly improved performance with respect 
to strong baselines for scene depth estimation, pedestrian 
detection and semantic segmentation (Section 6). 

2. Contribution and Related Work 

The contribution of our work is twofold. First, we 
demonstrate a pipeline that performs precise resectioning 
of test images against GIS map data to generate geoseman¬ 
tic context applicable to multiple scene understanding tasks. 
Secondly, we show that simple cues derived from the GIS 
model can provide significantly improved performance over 
baselines for depth estimation, pedestrian detection and se¬ 
mantic segmentation. We briefly discuss the differences be¬ 
tween our results and closely related work. Figure 1 shows 
an overview of the system pipeline. 

GIS for image understanding The role of GIS map data 
in automatically interpreting images of outdoor scenes ap¬ 
pears to have received relatively little attention in computer 
vision. While detailed GIS map data is used extensively 
in analysis of aerial images [42, 43, 3 ], only a handful of 
papers have exploited this resource to study scene under¬ 
standing from a ground-level perspective. Recently, a few 
groups have looked at using GIS data and multi-view geom¬ 
etry for improving object recognition. [30] introduced a ge¬ 
ographic context re-scoring scheme for car detection based 
on street maps. Ardeshir et al. used GIS data as a prior for 
static object detection and camera localization [4] and for 
performing segmentation of street scenes [3]. Compared to 
these works, our pipeline utilizes full 6D camera pose esti¬ 
mation and a richer scene model derived from the GIS map 
that supports not only improved detection rescoring but also 
depth and semantic label priors. 

Perhaps most closely related work to our approach is 
[A ] which uses a CRF to simultaneously estimating depth 
and scene labels using a strong 3D model. However, [44] 
assumes that both camera localization and a CAD model of 


a scene with relevant categories (e.g. trees) are provided 
as inputs. In contrast, we address the construction of a 3D 
scene model by combining image and map data as well as 
test-time camera localization using model alignment and 
resectioning techniques. Interestingly, we also show that 
lifted GIS data can improve segmentation accuracy even for 
semantic labels that are not present in the GIS map model 
(e.g. trees, retaining walls). 

Contextual detection rescoring Rescoring object detec¬ 
tor outputs based on contextual and geometric scene con¬ 
straints has been suggested in a wide variety of settings. 
For example, Hoiem et al. [20] used monocular estimation 
of a ground-plane to rescore car and pedestrian detections, 
improving the performance of a contemporary baseline de¬ 
tector [8]. Similarily, [4] showed improvement on hard to 
detect objects such as fire hydrants. However, the benefit of 
scene geometry constraints appears to be substantially less 
when a more robust baseline detector such as DPM [11] is 
used. For example, [ ( ] reported that monocular ground- 
plane constraints failed to improve the performance of a 
DPM pedestrian detector. Similarly, [3( ] reported that ge¬ 
ometric rescoring based on road maps did not improve per¬ 
formance of a DPM car detector (although rescoring did im¬ 
prove detection for a weaker detector that had been trained 
to predict viewpoints). In contrast to this previous work, we 
use a richer rescoring model that allows for non-flat sup¬ 
porting surfaces and integrates additional geosemantic cues, 
resulting in a significant boost (5% AP) in pedestrian detec¬ 
tion even when rescoring a strong baseline model. 

3. Lifting GIS Maps 

In this section, we describe the construction of a dataset 
for exploring the use of strong GIS-derived geometric con¬ 
text. We focus on novel aspects of this pipeline which in¬ 
clude: aligning GIS and SfM data, a user-friendly toolbox 
to lift 2D geosemantic information provided by a map into 
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3D geometric context models, and model-assisted camera 
resectioning to allow quick and accurate localization of a 
test image with respect to the context model. 

Image database acquisition We collected a database of 
6402 images covering a large area (the engineering quad) 
of a university campus. Images were collected in a system¬ 
atic manner using a pair of point and shoot cameras attached 
to a monopod. We chose locations so as to provide approx¬ 
imately uniform coverage of the area of interest. Images 
were generally collected during break periods when there 
were relatively few people present (although some images 
still contain pedestrians and other non-rigid objects). 

Running off-the-shelf incremental structure from motion 
(e.g., [2, 40]) on the entire dataset produces a 3D structure 
that is qualitatively satisfying but often contains metric in¬ 
accuracies. In particular, there can be a significant drift 
over the whole extent of the recovered model that makes 
it impossible to globally align the model with GIS map 
data. These reconstruction results were highly dependent on 
the quality of the initial matches between individual cam¬ 
era pairs. However, we found that the excellent incremen¬ 
tal SfM pipeline implementation in OpenMVG [32], which 
uses the adaptive thresholding method of [: ], yielded su¬ 
perior results in terms of the accuracy and number of re¬ 
covered cameras and points. Our final model included 4929 
succesfully bundled images. 

Global GIS-structure alignment We obtained a 2D GIS 
map of the campus maintained by the university’s building 
management office. The map was originally constructed 
from aerial imagery and indicates polygonal regions corre¬ 
sponding to essential campus infrastructure tagged with se¬ 
mantic labels including building footprints, roadways, fire- 
lanes, lawns, etc. 

We would like to align our SfM reconstruction with 
this model. One approach is to leverage existing sets of 
geo-referenced images. For example, [30] used aerial Li- 
DAR and Google Street View images with known geo¬ 
coordinates in order to provide an absolute coordinate ref¬ 
erence. While this is practical when the 3D GIS data has 
already been generated, we start from a flat 2D model de¬ 
picting an environment for which no precisely georegistered 
images were readily available. 

To quickly produce an initial rough alignment, we 
project the point cloud recovered using SfM onto a ground- 
plane and have a user select 3 or more points (typically 
building corners as they are more visible). We run Pro¬ 
crustes alignment to match the user selected points to the 
real coordinates in our 2D GIS data. This global 2D align¬ 
ment is sufficient to register the SfM model in geographic 
coordinates for the intial construction of a 3D model. Once 
the 3D GIS model is defined (see below), we used an itera¬ 
tive closest point approach to automatically refine the align- 



Figure 2: We developed a custom Sketchup plug-in that imports 
camera parameters computed during bundle adjustment. The user 
easily models the 3D geometry by extruding, tilting, and carving 
the 2D map data until it naturally aligns with the image. 

ment of the GIS model and SfM point cloud. 

Image-assisted 3D model construction We developed a 
custom plug-in for the 3D modeling tool Sketchup [ ] to al¬ 
low efficient user-assisted “lifting” of 2D GIS map data into 
a full 3D model. We imported the GIS 2D polygons with 
their corresponding semantic labels into the workspace. 
The user is then presented with a choice of images to load 
from the globally aligned SfM model. When an image is 
selected, the 3D model view is adjusted to match the re¬ 
covered camera extrinsic and intrinsic parameters and the 
image is transparently overlayed, providing an intuitive vi¬ 
sualization as shown in Figure 2. The user can then extrude 
flat 2D polygons (e.g., building footprints) to the appropri¬ 
ate height so that the backprojected view into the selected 
camera matches well with the overlayed image. 

A full 3D mesh model can be easily constructed starting 
from the aligned 2D map by extruding buildings up, carving 
stairs down, tilting non-fully horizontal surfaces, etc. using 
standard Sketchup modeling tools and guided by the im¬ 
age overlay. Additional geometry can be easily created, as 
well as adding additional semantic labels or corrections to 
the original data. With the assistance of these aligned cam¬ 
era views, constructing a fairly detailed model in Sketchup 
covering 10 buildings took approximately 1-2 hours. This 
can be considered an offline task since the modeling effort 
is performed only once, as buildings are largely static over 
time. 

GIS-assisted test image resectioning To estimate cam¬ 
era pose at test time, we resection each test image against 
the SfM image dataset based on 2D-3D matching corre¬ 
spondences using a RANSAC-based 3-point absolute pose 
procedure [2- ]. We developed several techniques for im¬ 
proving the accuracy of this resectioning. As the bundled 
image dataset grows to cover larger areas, resectioning ac¬ 
curacy generally falls since the best match for a given fea¬ 
ture descriptor in the test image is increasingly likely to be a 
false positive. To leverage knowledge of spatial locality and 
scale to large datasets, we partitioned the bundled cameras 
into k = 10 clusters using k-means over database camera 
positions. Points in the bundled model were included in 
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any camera cluster in which they were visible yielding 10 
non-disjoint clusters of points. 

Each test image was resectioned independently against 
each spatial cluster and the best camera pose estimate 
among the clusters was selected using geometric heuris¬ 
tics based on the GIS model. We measured the distance to 
ground (height) and camera orientation with respect to the 
gravity direction in the GIS model. All camera poses be¬ 
low the ground plane, higher than 4 meters, or tilted more 
than 30 degrees were discarded. If a camera pose was 
geometrically reasonable in more than one cluster, we se¬ 
lected the estimate with the highest number of matched in- 
liers. Incorporating these heuristic constraints increased 
the proportion of correctly resection test images substan¬ 
tially (from 49% to 59%). This improvement would not pos¬ 
sible without the GIS-derived 3D model since the ground 
elevation varies significantly and cannot be captured by a 
simple global threshold on the camera coordinates. 

Resectioning of a test image produces immediate predic¬ 
tions of scene depth, surface orientation, and semantic la¬ 
bels at every image pixel. It is interesting to compare these 
to previous works that attempt to estimate such geometric 
or semantic information from a single image. By simply re¬ 
sectioning the picture against our 3D model we are immedi¬ 
ately able to make surprisingly accurate predictions without 
running a classifier on a single image patch! In the remain¬ 
der of the paper, we discuss how to upgrade these “blind” 
predictions by incorporating backprojected model informa¬ 
tion into standard detection and segmentation frameworks. 

4. Strong Geometric Context for Detection 

Estimating camera pose of a test image with respect to 
a 3D GIS model aids in reasoning about the geometric va¬ 
lidity of hypothesized object detections (e.g. pruning false 
positives in unlikely locations or boosting low-scoring can¬ 
didates in likely regions). We describe a collection of fea¬ 
tures that capture these constraints and use them to improve 
detection performance. We incorporate these features into 
a pedestrian detector by training an SVM with geometric 
context features (GC-SVM) that learns to better discrimi¬ 
nate object hypotheses based on the geosemantic context of 
a candidate detection. 

3D geometric context Let a candidate 2D bounding box 
in a test image I have an associated height in pixels h irn . We 
use a deformable part model that consists of a mixture over 
three different template filters: full-body, half upper body, 
and head. We set the image detection height h irn based on 
which mixture fires as 1, 2 or 3 times the bounding box 
height respectively. 

If we assume that the object is resting on a horizontal 
surface and the base of the object is visible, then we can es¬ 
timate the 3D location the object is occupying. We find the 


depth at intersection Zi of the object with respect to the cam¬ 
era by shooting a ray from the camera center through the 
bounding box’s “feet” and intersecting it with the 3D model. 
Importantly, unlike many previous works, the ground is not 
necessarily a plane (e.g., our model includes ground at dif¬ 
ferent elevations as well as stairs and ramps). Given camera 
focal length /, we can estimate the height in world coordi¬ 
nates by the following expression: 

hi = jh im ( 1 ) 

Unfortunately, the object’s “feet” might not be visible 
at all times (e.g. a low wall is blocking them). Let h M 
be the “physical height” of an average bounding box (i.e., 
the height of an average human). We collect all possi¬ 
ble intersections with the model (a blocking wall and the 
ground plane hehind it) and choose the hi that minimizes 
(hi - h |[t ) 2 . 

An alternative to hypothesize an object’s height is by its 
inverse relation with depth. We can estimate an object’s 
depth z 0 based on the expected average human height h^. 
A bounding box of size h irn in the image has an expected 
distance 

f 

Zo = h u —. — (2) 

from the camera. Given z Q , we produce a second height 
estimate h Q by tracing a ray through the center top of the 
detection to a depth z 0 and then measuring the distance from 
this estimated head position to the ground plane. 

For each height estimator hi , h Q we also extract a cor¬ 
responding semantic label associated with the GIS-model 
polygon where the feet intersect and record binary variables 
Wi , w Q indicating whether the polygon has a “walkable” se¬ 
mantic label and rii , n 0 indicating wheter it has a horizontal 
surface normal. Our feature vectors for each estimate are 
given by: 

Fi = [vi(hi - h^) 2 ,Wi,ni, (1 - v^] (3) 

F 0 = [v 0 (h 0 -h^) 2 ,w 0 ,n 0 ,(l-v 0 )} ( 4 ) 

where = 1.7 m is the average human height and Vi, v Q 
are binary variables indicating whether the the correspond¬ 
ing height cound be measured. For example, if the ray to 
the foot to compute z t does not intersect the model or if the 
depth estimate is behind the model surface, we mark the 
estimate as invalid and zero out the feature vector. 

2D geosemantic context In addition to the height and 
foot location, we extract geosemantic context by backpro- 
jecting model semantic labels and surface normals into the 
image plane and look at the distribution inside of the ob¬ 
ject bounding box. For each bounding box b , we hypoth¬ 
esize a full-body bounding box given the detection’s mix¬ 
ture as previously mentioned. We then split such box verti- 
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cally into 3 parts (top, center, bottom) and collect normal¬ 
ized histograms H b of the distribution of semantic labels 
and surface normals in each subregion. We account for 5 
GIS labels (building, plants, pavement, sky, unknown) and 4 
discretized surface normal directions (ground, ceiling, wall, 
none). 

[Hfop, H cen f er , -fffrottora] (5) 

To allow the learned S VM weights to depend on the original 
detector mixture component m, we contruct an expanded 
feature vector from these histograms: 

F b = [S(m = 1 )H b , S(m = 2 )H b , 5{m = 3 )H b \ ( 6 ) 

where m G 1,2,3 indicates a full-body, half-upper, or head 
mixture respectively. For example, this allows us to model 
that upper-body detections are correlated with the presence 
of some vertical, non-walkable surface such as a wall oc¬ 
cluding the lower body. 

We train an SVM to rescore each detection using a final 
feature vector that includes the detector score s along with 
the concatenated context features 

F = [s> F i: F 0 , F b \ 

5. Strong Geometric Context for Segmentation 

The geometric and semantic information contained in the 
GIS data and lifted into the 3D GIS model can aid in reason¬ 
ing about the geometric validity of class hypotheses (e.g., 
horizontal surfaces visible from the ground are not typically 
buildings). We describe methods for using such constraints 
to improve semantic segmentation performance. We follow 
a standard CRF-based labeling approach from Gould et al. 
[1 ] which uses an augmented set of image features from 
[39]. We explore simple ways to enhance this set of features 
using GIS data and study its influence on semantic labeling 
accuracy. 

GIS label prior distributions The GIS-derived model 
provides an immediate estimation of pixel labels based on 
the 4 semantic labels in the original GIS map (building, 
plants, pavement, and sky). If a camera pose is known, we 
can backproject the model into the image plane and transfer 
the polygon label in the GIS model to the projected pixel. 
However, camera pose estimation is not perfect and might 
contain minimal deviations from its ground truth pose. In 
order to account for slight camera pose inaccuracies, we 
define a 16-dimensional feature descriptor to softly handle 
these cases. Given an image /, a pixel x G /, and a back- 
projected GIS semantic label g(x), we define the feature 

K( x ) 

K,k ( x ) = F 1 9(y) = k ] ( 7 ) 

y:\\y-x\\<r 



Ground Truth Make3D [38] DNN [10] GIS 


Figure 3: Qualitative depth comparison. Our GIS backprojected 
depth map is shown in the last column. While it lacks many de¬ 
tails such as foliage and pedestrians which are not included in 
our coarse GIS-based 3D model, simply backprojecting depth pro¬ 
vides substantially more accurate estimation than existing monoc¬ 
ular approaches. 

as the normalized count of class k pixels in a circular disc 
of radius r around x, where N is the number of pixels in 
the disc. In our experiments, we define r so that the angular 
error of the camera pose is 0, 1,3, and 5 degrees. 

GIS surface normal distributions In a similar manner, 
a surface normal can be quickly estimated for any pixel by 
backprojecting the 3D model into the camera plane. Surface 
normals can be discriminative of certain classes like pave¬ 
ment, roads, buildings, etc. Following the same structure as 
in equation 7, we define the 12-dimensional feature h r n (x) 
as 

KA X ) = ^ F Kj/) G Nk \ (8) 

y:\\y-x\\<r 

where Nk is one of 3 possible surface orientation bins: hor¬ 
izontal (ground), horizontal (ceiling), vertical (wall). 

GIS Depth features Depth can also be efficiently es¬ 
timated when from a 3D model when a camera pose is 
known. Following other methods like [46, 35], we extract 
HOG features [8] to encode depth variations. We did find 
a substantial gain in certain categories when adding these 
features into the model (e.g. wall). 

DPM as a context feature Inspired by other works that 
try to create segmentation-aware detectors [12, 25, 29, 34], 
we also incorporate the outputs of category-specific object 
detectors in our segmentation model. To do so, we collect 
the scores of a DPM detector for an object category c and 
generate a DPM feature map h c by assigning to every pixel 
the maximum score of any of the candidate detection boxes 
intersecting the given pixel. Let Q c be the set of candidate 
detections and bi , Si the bounding box and score for the ith 
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Figure 4: Quantitative comparison of depth estimation methods: 
Make3D [38], Deep Neural Network [10], and GIS backprojec- 
tion. We list the proportion of depths within a specified maxi¬ 
mum allowed relative error S = max( 4 ^-, where d a t is 

the ground truth depth and d es t is the estimated depth at some 
point. The DNN model predictions were re-scaled to match our 
ground truth data since the model provided is adapted only for in¬ 
door scenes. 

detection, then 

h°(x) = max(s* • [x G 6*]) (9) 

i£fl c 

6. Experimental results 

In our experiments, we started with a test set compris¬ 
ing 570 pictures taken in the area covered by the 3D model. 
These images were collected over different days and sev¬ 
eral months after the initial model dataset was acquired. 
Of these images, 334 images (59%) were successfully re¬ 
sectioned using the cluster-based approach from Section 
3. This success rate compares favorably with typical suc¬ 
cess rates for incremental bundle adjustment (e.g., [30] reg¬ 
ister 37% of their input images) even though our criteria 
for correctness is more stringent (we manually scored test 
images rather than relying on number of matching feature 
points). To evaluate performance of detection, we anno¬ 
tated resectioned images with ground-truth bounding boxes 
for 1484 pedestrians using the standard PASCAL VOC la¬ 
beling practice including tags for truncated and partially oc¬ 
cluded cases. We used 167 images for training our geo¬ 
metric context rescoring framework and left the remaining 
167 for testing. We also manually segmented 305 of these 
images using the segmentation tool provided by [28] and 
labeled each segment with one of 9 different semantic cate¬ 
gories. We split the segmentation data into 150 images for 
training and 155 for testing. 

6.1. Monocular Depth Estimation 

To verify that the coarse-scale 3D GIS model provides 
useful geometric information, despite the lack of many de¬ 
tailed scene elements such as trees, we evaluated resection¬ 
ing and backprojection as an approach for monocular depth 
estimation. While our approach is not truly monocular since 
it relies on a database of images to resection the camera, 
the test time data provided is a single image and constitutes 
a realistic scenario for monocular depth estimation in well 
photographed environments. 

To establish a gold-standard estimate of scene depth, we 
scanned 14 locations of the area covered by our dataset us¬ 
ing a Trimble GX3D terrestrial laser scanner. We took the 




Figure 5: Accuracy of predicted depth estimates compared to gold- 
standard provided by a laser-scanner for 14 images. Top histogram 
shows the distribution (log counts) of absolute depth errors be¬ 
tween Make3D, DNN and depth computed from resectioning and 
backprojecting GIS derived 3D model. The bottom plot shows the 
distribution of relative error | d est — d g t\/d gt . 

scans at in a range of resolution between 5 and 12 cm in 
order to keep the total scanning time manageable, resulting 
in roughly a half a million 3D points per scan. We mounted 
a camera on top of the laser scanner and used the camera 
focal length to project the laser-based 3D point cloud onto 
the camera image plane, interpolating depth values to ob¬ 
tain a per-pixel depth estimate. We then resectioned the test 
image and synthesized the depth map predicted by our 3D 
GIS model. 

Figure 4 shows quantitative results of our GIS backpro¬ 
jection depth estimation against other single-image depth 
approaches for the 14 scan dataset. We used the provided 
pre-trained models included in [38, 11 ] as baselines for 
comparison. Since the pretrained DNN model [10] was 
only available for indoor scenes (trained from Kinect sen¬ 
sor data), we estimated a scaling and offset factors for the 
output that minimized the relative error over the 14 images. 
While the optimally scaled pretrained DNN model is prob¬ 
ably suboptimal, it is not possible to retrain the model with¬ 
out collecting substantially larger amounts of training data 
using specialized hardware (a laser scanner). In contrast, 
our proposed approach of simply backprojecting GIS data 
greatly outperforms image-predictions using only camera 
pose and a map, with no training data required! 

6.2. Object Detection 

We evaluated our geometric and semantic object rescor¬ 
ing scheme applied to the widely used deformable part 
model detector (DPM) [11] implemented in [13]. We used 
a standard non-maxima suppression ratio of 0.5 in the inter- 
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Figure 6: Geometric context aids in recognizing discriminative 3D 
and 2D features that improve the average precision in pedestrian 
detection. Our GC-SVM obtained a 5% boost in AP with respect 
to the standard DPM model. 

section over union overlap. 

Of the 167 training image split, we collected the set of 
geometric context features F described in Section 4 along 
with the appropriate label indicating whether the bounding 
box was a true or false positive. We trained a linear SVM 
classifier using the implementation of [ 6 ]. In our experi¬ 
ments, we set the regularization C = 4 using 5-fold cross- 
validation. To accommodate class imbalance and maximize 
average precision, we used cross-validation to set the rela¬ 
tive penalty for misclassified positives to be 4x larger than 
for negatives. 

We benchmarked the trained Geometric Context SVM 
classifier using the same features F collected for the 167 
image test split. While standard DPM detector score pro¬ 
vided a baseline average precision (AP) of 0.457, our GC- 
SVM model achieved an AP of 0.507. It is important to 
note that previous attempts to incorporate GIS-based geo¬ 
metric rescoring into a DPM classifier provided little to no 
improvement. In [30], 3D context between cars and streets 
allowed for improved geometric reasoning about car orien¬ 
tation but only small gains in detection performance. In fact 
their final VP-LSVM and VP-WL-SSVM models had lower 
average precision than a baseline DPM model. 

6.3. Semantic Segmentation 

We built two baseline segmentation models using the 
CRF-based multi-class segmentation code provided by [14]. 
We trained one model using our training labeled set of en¬ 
gineering quad images (CRF ENGQ) as a scene-specific 
model. We also trained a generic model using images col¬ 
lected from the online SUN dataset [45] by querying for 
multiple categories in the SUN label set that are semanti¬ 
cally equivalent to the 9 categories labeled in our engineer¬ 
ing quad dataset (CRF SUN). Finally, we added our GIS 
and DPM features to the ENGQ model and evaluated their 
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Figure 7: Detection results at 0.6 recall. Geosemantic context 
successfully removes high DPM score false positives at unlikely 
places without adding too many low DPM score ones at coher¬ 
ent regions. This successful trade-off benefits the performance at 
almost all levels of recall. 

effects on segmentation performance. 

Detectors Improve Segmentation We collected DPM 
score features as described previously for two objects of 
interest: pedestrian and bicycle. Figure 8 shows the influ¬ 
ence of adding these features into the model (+DPM rows). 
We raised pedestrian segmentation accuracy from 0.400 to 
0.583 and bicycle from 0.370 to 0.482 in the ENGQ model. 
We also improved pedestrian segmentation in the generic 
SUN model. 

It is interesting to note how these detection priors mix 
with geosemantic information. In the presence of geometric 
context, pedestrian segmentation was slightly hurt. How¬ 
ever, sitting pedestrian segmentation is boosted from almost 
0 accuracy up to 0.108. Bicycle also benefits from geomet¬ 
ric context and boosts from 0.458 to 0.530. 

GIS-aware Segmentation To evaluate the influence of 
GIS features alone, we first trained a CRF model without 
the image features from [15] and only used the contextual 
features described in Section 5 (GIS CRF). This yielded 
relatively good accuracy in the 4 labels present in the GIS 
model, but poor results for many others. This is quite natu¬ 
ral since our GIS model does not include detailed elements 
such as benches, and provides no information about what 
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Model 

Overall 

building 

plants 

pavement 

sky 

ped. 

ped. sit 

bicycle 

bench 

wall 

Image CRF (SUN) 

0.309 

0.767 

0.581 

0.863 

0.872 

0.007 

0.000 

0.000 

0.000 

0.000 

+DPM 

0.323 

0.774 

0.586 

0.864 

0.873 

0.129 

0.000 

0.001 

0.000 

0.000 

GIS Label Backprojection 

0.242 

0.688 

0.099 

0.810 

0.581 

0.000 

0.000 

0.000 

0.000 

0.000 

GIS CRF 

0.290 

0.730 

0.316 

0.847 

0.705 

0.000 

0.000 

0.000 

0.000 

0.014 

Image CRF (ENGQ) 

0.561 

0.917 

0.886 

0.925 

0.949 

0.400 

0.010 

0.370 

0.241 

0.348 

+GIS 

0.584 

0.937 

0.894 

0.936 

0.963 

0.394 

0.060 

0.385 

0.208 

0.481 

Depth 

0.569 

0.935 

0.895 

0.937 

0.957 

0.390 

0.011 

0.358 

0.179 

0.455 

Labels 

0.575 

0.938 

0.892 

0.933 

0.966 

0.374 

0.064 

0.366 

0.221 

0.419 

Normals 

0.568 

0.935 

0.893 

0.933 

0.961 

0.389 

0.007 

0.385 

0.192 

0.418 

+DPM 

0.590 

0.920 

0.892 

0.929 

0.947 

0.583 

0.013 

0.482 

0.184 

0.360 

+DPM+GIS 

0.627 

0.936 

0.894 

0.938 

0.961 

0.568 

0.108 

0.520 

0.245 

0.472 


Figure 8: Quantitative segmentation results for models trained with generic (SUN) and scene specific (ENGQ) data. Accuracy is measured 
using PASCAL intersection-over-union (IOU) protocol. Adding geosemantic (+GIS) and detection (+DPM) features outperformed the 
baseline models. Combining both methods gave the best overall results in the scene specific model, although some classes did not achieve 
their best accuracy individually. 


pixels might be a bike or pedestrian on any given day. How¬ 
ever, this model still improves a simple “blind” backprojec- 
tion of the GIS labels. 

On the other hand, combining these GIS features with 
standard image features gave a significant benefit, outper¬ 
forming the image CRF baseline in almost all categories 
(+GIS rows in Figure 8). It is interesting to note that la¬ 
beling of some categories that did not appear in the GIS 
map data (e.g. bench and wall) is still improved signifi¬ 
cantly by the geometric context provided in the model (wall 
is boosted from 0.348 to 0.481, presumably since the local 
appearance is similar to building but the geometric context 
is not). This is in contrast to, e.g. [44], where all labels were 
included in either the GIS or detector driven priors. 

Scene Specific vs Generic Models Even without pre¬ 
cisely resectioning the test image, there is a significant gain 
in accuracy from knowing the rough camera location. When 
the camera location is completely unknown, the best we 
can do is invoke the CRF trained on generic SUN data 
(0.309 accuracy). However, if we know the camera is lo¬ 
cated somewhere on the engineering quad, we can invoke 
the scene specific CRF trained on ENGQ to boost perfor¬ 
mance to 0.561. With resectioning, we can further utilize 
the 3D model (+GIS) to gain an additional 3% in segmenta¬ 
tion performance by utilizing geosemantic context features 
in the unary potential classifier. This gain is quite signifi¬ 
cant given that some class baselines are already over 90% 
in accuracy leaving little room for improvement. 

7. Conclusion 

The rapid growth of digital mapping data in the form of 
GIS databases offers a rich source of contextual informa¬ 
tion that should be exploited in practical computer vision 
systems. We have described a basic pipeline that allows for 
integration of such data to guide both traditional geometric 
reconstruction as well as semantic segmentation and recog¬ 
nition. With a small amount of user supervision, we can 
quickly lift 2D GIS maps into 3D models that immediately 



■ building ■ plants ■ pavement sky pedestrian 

ped. sitting ■ bicycle ■ bench wall 

Figure 9: Qualitative segmentation results with overlaid images. 
Our combined model improves over a image-based CRF by incor¬ 
porating features derived from GIS (depth, labels, normals) and a 
DPM detector. 


provide strong scene geometry estimates (typically less than 
5-10% relative depth error), greatly outperforming exist¬ 
ing approaches monocular depth estimation and providing 
a cheap alternative to laser range scanners. This also pro¬ 
vides strong geometric and semantic context features that 
can be exploited to improve detection and segmentation. 
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